1 Abstract

2 Introduction

3 Methods

First of all, a new directory in the DNA High-Performance Computing Cluster was created, which is

/mnt/Timina/bioinfoII/amarin/RNAseq2/

This directory has the next organization:

|-data/           # Directory to download the raw data
|-data_trimmed/   # Directory to save the post-QC data
|-kallisto_out/   # Directory to save the kallisto outputs
|-QC/             # Directory to perform raw QC
|-QC_trimmed/     # Directory to perform QC on the trimmed data
|-scripts/        # Some scripts used to send jobs

3.1 Download

The dataset choose to analysis has de GEO Accession GSE91061. Since downloading from the NCBI may be difficult and quite technical, it was choosen to use the European mirror, the European Nucleotide Archive from the EMBL-EBI. By browing on the GEO web page, we see that the data are from the BioProject PRJNA356761. We use that ID to search on ENA.

Due the fact that the are a lot of file needed to be downloaded, we use the web application SRA Explorer to get a script to get the bulk download.

We then select all the files and click on “Add to collection”. After that, we see our saves datasets and select the bash script for downloading FastQ files option. We save the script with the name downloadScript.sh, and save it in the HPCC, in the //data/ directory.

Then, we send a job using the script. It lasted around 5 days.

3.2 Raw Quality Control

Once the data is ready, we perform quality control. We just send a job using fastqc with all the fastq files via a wildcard:

fastqc /mnt/Timina/bioinfoII/amarin/RNAseq2/data/*fastq.gz -o /mnt/Timina/bioinfoII/amarin/RNAseq2/QC

Then, we used multiqc to visualize all the samples at once:

multiqc /mnt/Timina/bioinfoII/amarin/RNAseq2/QC -o /mnt/Timina/bioinfoII/amarin/RNAseq2/QC

This allowed us to see some irregularities. First, the first to nucleotides of each read had a smaller mean quality score, compared to the rest of the read.

Besides, in the heatmap, the base sequence sequence content seemed to be biased on the first base pairs of the read.

All the samples resembled more or less the next percentage of base content.

Nevertheless, MultiQC didn’t detect any adapter content. Based on that, we decided to trimm the first 12 nucleotides of each pair.

3.3 Trimming

We send a job in the //data/ directory with a script using the software Trimmomatic:

for i in *_1.fastq.gz;
    do echo
    trimmomatic PE -threads 40 -phred33 -trimlog trim.log \
        $i "${i%_1.fastq.gz}_2.fastq.gz" \
        ../data_trimmed/"${i%_1.fastq.gz}_1_trimmed.fastq.gz" \
        ../data_trimmed/"${i%_1.fastq.gz}_1_unpaired.fastq.gz" \
        ../data_trimmed/"${i%_1.fastq.gz}_2_trimmed.fastq.gz" \
        ../data_trimmed/"${i%_1.fastq.gz}_2_unpaired.fastq.gz" \
        HEADCROP:12
  done

Here we just cut the first 12 nucleotides of each read and create a trim.log log file by the use of 40 threads. We know that the Phred score are based on ASCII 33 because previously we found a # symbol on the reads.

This job took a couple of days.

3.4 Trimmed Quality Control

Once the data was trimmed, we did one more time the quality control.

fastqc /mnt/Timina/bioinfoII/amarin/RNAseq2/data_trimmed/*fastq.gz \
  -o /mnt/Timina/bioinfoII/amarin/RNAseq2/QC_trimmed
multiqc /mnt/Timina/bioinfoII/amarin/RNAseq2/QC_trimmed \
  -o /mnt/Timina/bioinfoII/amarin/RNAseq2/QC_trimmed

This time, the quality of the first pair bases was better.

Besides, the “per base sequence content” was less biased:

Each sample was more homogeneous.

Besides, other indicators, such as the “per base N content” and “per sequence GC content” performed better.

3.5 Pseudoaligment to reference genome

Once the data was ready, we decided to generate count matrices using Kallisto. To perform this, we first need a reference transcriptome. We choose the latest version available at GENCODE, which is the release 43. The file was downloaded at the parent directory of this project:

wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_43/gencode.v43.transcripts.fa.gz

Then, generated the kallisto index:

kallisto index -i /mnt/Timina/bioinfoII/amarin/RNAseq2/kallisto_out/Hs_ref.kidx \
  /mnt/Timina/bioinfoII/amarin/RNAseq2/gencode.v43.transcripts.fa.gz

Then, we used a bash script to perform the quantification with all the samples.

for file in /mnt/Timina/bioinfoII/amarin/RNAseq2/data_trimmed/*_1_trimmed.fastq.gz
do
    op=$(basename $file | cut -c-10 )         
    file2=$(echo $file | sed 's/_1_/_2_/')  
    mkdir /mnt/Timina/bioinfoII/amarin/RNAseq2/kallisto_out/${op}
    
    kallisto quant -i /mnt/Timina/bioinfoII/amarin/RNAseq2/kallisto_out/Hs_ref.kidx \
    -o /mnt/Timina/bioinfoII/amarin/RNAseq2/kallisto_out/${op} \
    --threads 40 $file $file2
done

3.6 Differencial Expresion

For the differential expression analysis we used DESeq2, a Bioconductor package running on R 4.3.0 using a PC. The next, are the libraries to use.

library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(tximport)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.1.8
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks plotly::filter(), stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(DESeq2)
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union
## 
## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union
## 
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## 
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which.max, which.min
## 
## 
## Attaching package: 'S4Vectors'
## 
## The following objects are masked from 'package:lubridate':
## 
##     second, second<-
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, rename
## 
## The following object is masked from 'package:tidyr':
## 
##     expand
## 
## The following object is masked from 'package:plotly':
## 
##     rename
## 
## The following object is masked from 'package:utils':
## 
##     findMatches
## 
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## 
## Loading required package: IRanges
## 
## Attaching package: 'IRanges'
## 
## The following object is masked from 'package:lubridate':
## 
##     %within%
## 
## The following objects are masked from 'package:dplyr':
## 
##     collapse, desc, slice
## 
## The following object is masked from 'package:purrr':
## 
##     reduce
## 
## The following object is masked from 'package:plotly':
## 
##     slice
## 
## Loading required package: GenomicRanges
## Loading required package: GenomeInfoDb
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
## 
## Attaching package: 'matrixStats'
## 
## The following object is masked from 'package:dplyr':
## 
##     count
## 
## 
## Attaching package: 'MatrixGenerics'
## 
## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars
## 
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## 
## Attaching package: 'Biobase'
## 
## The following object is masked from 'package:MatrixGenerics':
## 
##     rowMedians
## 
## The following objects are masked from 'package:matrixStats':
## 
##     anyMissing, rowMedians
## Warning: replacing previous import 'S4Arrays::read_block' by
## 'DelayedArray::read_block' when loading 'SummarizedExperiment'
library(ggplot2)
library(ggrepel) 
library(rhdf5)
library(GenomicFeatures)
## Loading required package: AnnotationDbi
## 
## Attaching package: 'AnnotationDbi'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
## 
## The following object is masked from 'package:plotly':
## 
##     select
library(dplyr)
library(biomaRt)
library(gprofiler2)
library(ComplexHeatmap)
## Loading required package: grid
## ========================================
## ComplexHeatmap version 2.16.0
## Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
## Github page: https://github.com/jokergoo/ComplexHeatmap
## Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
## 
## If you use it in published research, please cite either one:
## - Gu, Z. Complex Heatmap Visualization. iMeta 2022.
## - Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional 
##     genomic data. Bioinformatics 2016.
## 
## 
## The new InteractiveComplexHeatmap package can directly export static 
## complex heatmaps into an interactive Shiny app with zero effort. Have a try!
## 
## This message can be suppressed by:
##   suppressPackageStartupMessages(library(ComplexHeatmap))
## ========================================
## 
## 
## Attaching package: 'ComplexHeatmap'
## 
## The following object is masked from 'package:plotly':
## 
##     add_heatmap
library(genefilter)
## 
## Attaching package: 'genefilter'
## 
## The following object is masked from 'package:ComplexHeatmap':
## 
##     dist2
## 
## The following objects are masked from 'package:MatrixGenerics':
## 
##     rowSds, rowVars
## 
## The following objects are masked from 'package:matrixStats':
## 
##     rowSds, rowVars
## 
## The following object is masked from 'package:readr':
## 
##     spec
library(clusterProfiler)
## 
## clusterProfiler v4.8.1  For help: https://yulab-smu.top/biomedical-knowledge-mining-book/
## 
## If you use clusterProfiler in published research, please cite:
## T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141
## 
## Attaching package: 'clusterProfiler'
## 
## The following object is masked from 'package:biomaRt':
## 
##     select
## 
## The following object is masked from 'package:AnnotationDbi':
## 
##     select
## 
## The following object is masked from 'package:IRanges':
## 
##     slice
## 
## The following object is masked from 'package:S4Vectors':
## 
##     rename
## 
## The following object is masked from 'package:purrr':
## 
##     simplify
## 
## The following object is masked from 'package:stats':
## 
##     filter
library(AnnotationDbi)
library(org.Hs.eg.db)
## 
#library(ggupset)
library(enrichplot)

3.6.1 Data importation

First, we imported the kallisto quantification tables to a local computer from the DNA HPCC. Then, we created a tx2gene table, which maps annotated genes with transcripts. To do this, we used the Bioconductor package biomaRt.

ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl")


tx2gene = getBM(attributes=c('ensembl_transcript_id','ensembl_gene_id'), 
                   mart = ensembl)

Next, we import the quantification table to the R environment. To do this, we use the Bioconductor package rhdf5

## Paths to h5 archives
files = file.path(list.dirs("./kallisto_data", recursive = FALSE),
                  "abundance.h5")
names(files) = str_extract(files, "SRR\\d+")

## Importation
txi.kallisto.tsv = tximport(files, type = "kallisto", tx2gene = tx2gene,
                            ignoreAfterBar = TRUE, ignoreTxVersion=TRUE)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 
## summarizing abundance
## summarizing counts
## summarizing length
head(txi.kallisto.tsv$counts)
##                 SRR5088813 SRR5088814 SRR5088815 SRR5088816 SRR5088817
## ENSG00000000003  1499.8897   1031.487  1312.8342  1808.0254  1579.2045
## ENSG00000000005     1.0000      0.000     0.0000     2.0000     0.0000
## ENSG00000000419  1861.0942   2168.997  1508.2203  1496.6707  2454.8492
## ENSG00000000457   963.2173   1055.715   641.8061   747.8741   556.1578
## ENSG00000000460  1011.8159   1148.560  1076.2147  1276.6267   515.2332
## ENSG00000000938  1112.0000    620.000  1809.0000  1215.0000   431.0000
##                 SRR5088818 SRR5088819 SRR5088820 SRR5088821 SRR5088822
## ENSG00000000003  1574.5759   5353.023  3867.1593  1036.6527  2788.7407
## ENSG00000000005     2.0000      0.000     4.0000     6.0000     0.0000
## ENSG00000000419  2549.8278   1708.095  2677.1232  2612.1233  1646.1536
## ENSG00000000457   524.0493   1587.810   833.4179   895.6544   739.3859
## ENSG00000000460   455.2848   1364.916   491.6654   946.9077   417.4794
## ENSG00000000938   992.0000     76.000   484.0000  1257.0000   661.0000
##                 SRR5088823 SRR5088824 SRR5088825 SRR5088826 SRR5088827
## ENSG00000000003  2868.4659   817.9287   294.2066  2170.6283  2477.1213
## ENSG00000000005     0.0000   187.0000    25.0000     3.0000     0.0000
## ENSG00000000419  1504.1887  1795.1807  1864.4394 13570.2126  3901.6615
## ENSG00000000457   931.6242   923.3614  1009.3774  1114.0828   698.9271
## ENSG00000000460   260.2280   346.5751   253.3507   716.0371   626.8997
## ENSG00000000938   483.0000   807.0000  3836.0000   175.0000  1183.0000
##                 SRR5088828 SRR5088829 SRR5088830 SRR5088831 SRR5088832
## ENSG00000000003  2961.8483  1959.4511  1420.5087  1617.4379  1480.6072
## ENSG00000000005     3.0000     0.0000     5.0000    15.0000     5.0000
## ENSG00000000419  3392.4707  3074.5196  2252.7432  2081.0145  1921.6448
## ENSG00000000457   912.6742   929.7829   748.5641   924.7083   679.3888
## ENSG00000000460   696.6745   635.6388   513.0464   578.0808   405.9423
## ENSG00000000938   618.0000    84.0000   176.0000   456.0000   498.0000
##                 SRR5088833 SRR5088834 SRR5088835 SRR5088836 SRR5088837
## ENSG00000000003   1367.503   540.4137   530.0340   981.7885   1221.375
## ENSG00000000005      0.000     0.0000     2.0000     3.0000      0.000
## ENSG00000000419   1967.910  2395.5939  2627.1136  1937.1176   1863.499
## ENSG00000000457   1006.676  1601.0491  1999.9430   879.7708   1003.785
## ENSG00000000460   1001.351   548.7528   715.2732   580.5069    613.107
## ENSG00000000938    461.000  3301.0000  3747.0000   886.0000    488.000
##                 SRR5088838 SRR5088839 SRR5088840 SRR5088841 SRR5088842
## ENSG00000000003  5246.5285  4371.1350  1386.6904  1667.9434  2370.8907
## ENSG00000000005     0.0000     0.0000     1.0000     0.0000     4.0000
## ENSG00000000419  2203.0075  1880.9251  1070.5892  1093.1452  4496.7546
## ENSG00000000457   920.7094   837.1885   611.6103   584.0139   574.4228
## ENSG00000000460   526.7809   448.6499   130.6015   190.1884   282.7219
## ENSG00000000938   169.0000   161.0000    43.0000    50.0000   574.0000
##                 SRR5088843 SRR5088844 SRR5088845 SRR5088846 SRR5088847
## ENSG00000000003  1773.1713  1737.9675   586.3445   917.7317   734.9708
## ENSG00000000005     9.0000     5.0000     0.0000     7.0000    18.0000
## ENSG00000000419  8293.1696  2034.9443  2283.5735  1823.0789  2212.8519
## ENSG00000000457   672.9638  1112.1147   665.1059  1094.2700  1346.8390
## ENSG00000000460   483.8449   268.2798   233.3754   611.2008   533.3934
## ENSG00000000938   455.0000   737.0000   570.0000   850.0000  1266.0000
##                 SRR5088848 SRR5088849 SRR5088850 SRR5088851 SRR5088852
## ENSG00000000003  6306.3849   5074.493   606.6871   866.3972   862.9225
## ENSG00000000005    34.0000      0.000     6.0000    18.0000     0.0000
## ENSG00000000419  3604.5185   3582.646  2216.1319  2180.0030  2046.2397
## ENSG00000000457  1556.8109   1279.719   633.8067   565.3440  1153.2206
## ENSG00000000460   892.5964   1244.861   314.9208   345.0006   594.8391
## ENSG00000000938   131.0000    106.000    48.0000    53.0000  1199.0000
##                 SRR5088853 SRR5088854 SRR5088855 SRR5088856 SRR5088857
## ENSG00000000003  2671.1411  2074.6614  3658.8181  2655.2742   2136.568
## ENSG00000000005     0.0000     0.0000     0.0000     0.0000      0.000
## ENSG00000000419  1734.9773  2122.9492  3652.9663  4202.0157   3091.590
## ENSG00000000457   917.4060   914.4429   941.8504   779.9289   1108.435
## ENSG00000000460   335.7866   365.9461   555.1274   483.9068   1349.219
## ENSG00000000938   246.0000   704.0000   151.0000   235.0000    457.000
##                 SRR5088858 SRR5088859 SRR5088860 SRR5088861 SRR5088862
## ENSG00000000003  1720.1813  2785.8307   295.9765   53.23744  3496.2914
## ENSG00000000005     0.0000     0.0000     0.0000    7.00000   307.0000
## ENSG00000000419  2762.7303  3231.6175  1099.6812 5575.89804  1813.6622
## ENSG00000000457   937.2847   920.3392   450.4721  749.86228   887.6986
## ENSG00000000460  1437.1010  1129.5838   223.7868  691.37841   398.9935
## ENSG00000000938   499.0000   143.0000  5332.0000  351.00000  1364.0000
##                 SRR5088864 SRR5088865 SRR5088866 SRR5088867 SRR5088870
## ENSG00000000003  3980.2444   1643.481  2584.2381   1452.240  1103.5252
## ENSG00000000005     0.0000      7.000     0.0000      0.000     6.0000
## ENSG00000000419  2347.0594   6531.794  1792.6321   1596.549  2176.1678
## ENSG00000000457   810.3796   1668.369   652.4703   1188.964  1130.4216
## ENSG00000000460  1294.1503   1855.273   630.7542    775.583   708.0209
## ENSG00000000938    14.0000   1654.000  1222.0000    470.000   744.0000
##                 SRR5088871 SRR5088872 SRR5088873 SRR5088874 SRR5088877
## ENSG00000000003   791.5707   507.0273  1194.8276   996.4261  2134.4051
## ENSG00000000005    15.0000    55.0000     1.0000     9.0000     4.0000
## ENSG00000000419  2004.4437  1322.5974  2261.5769  1841.4780  2490.3217
## ENSG00000000457   847.4552   532.1253   767.3336   673.1486  1002.7374
## ENSG00000000460   374.4083   135.0421   448.1326   333.8946   599.8539
## ENSG00000000938   187.0000  2330.0000   113.0000   161.0000  1105.0000
##                 SRR5088878 SRR5088879 SRR5088880 SRR5088882 SRR5088883
## ENSG00000000003   2213.230   1676.773  2196.0660   1786.568  1225.2293
## ENSG00000000005      0.000    260.000     0.0000      0.000     7.0000
## ENSG00000000419   2556.585   2974.618  4423.9343   2419.445  1679.1308
## ENSG00000000457   1090.094   1409.785  1472.7298   1269.951   900.6896
## ENSG00000000460   1428.240    524.393   807.6177   1119.651   787.5410
## ENSG00000000938    336.000   1190.000   915.0000   3033.000  1744.0000
##                 SRR5088885 SRR5088886 SRR5088887 SRR5088888 SRR5088889
## ENSG00000000003  2409.6886  2392.8711   6001.996   2483.610  1895.6341
## ENSG00000000005    98.0000    51.0000      5.000      0.000    44.0000
## ENSG00000000419  2047.6542  2265.3464   3124.770   1829.664  1609.2291
## ENSG00000000457   834.1540   897.3309   3008.593   1684.672  1387.6998
## ENSG00000000460   655.1501   816.8001   3229.875   1580.403   508.6059
## ENSG00000000938   357.0000   301.0000    253.000    861.000    43.0000
##                 SRR5088890 SRR5088891 SRR5088892   SRR5088893 SRR5088894
## ENSG00000000003  1833.9203   333.4795   376.6351     2.085262  2429.6335
## ENSG00000000005    13.0000     0.0000     8.0000     0.000000     0.0000
## ENSG00000000419  1904.2238  1964.2980  2191.3371  1168.466344  2297.8610
## ENSG00000000457  1322.3372   976.8201  1450.7139   310.014678  1070.1167
## ENSG00000000460   570.3542   636.9625   763.8163    61.429664   712.8822
## ENSG00000000938    41.0000  1248.0000  1927.0000 23942.000000   164.0000
##                 SRR5088895 SRR5088896 SRR5088897 SRR5088898 SRR5088899
## ENSG00000000003  2278.8957  3603.5690  3056.4739  1045.5592  1010.5119
## ENSG00000000005     1.0000     0.0000     1.0000   154.0000     0.0000
## ENSG00000000419  2068.4917  1153.9548   996.5107  1283.3244  1624.0418
## ENSG00000000457   916.8713  1081.8873   935.1838   683.9812   653.9129
## ENSG00000000460   657.7822   497.3166   377.2150   360.0222   560.0213
## ENSG00000000938    92.0000   384.0000   319.0000  1804.0000  6096.0000
##                 SRR5088900 SRR5088903 SRR5088904 SRR5088905 SRR5088906
## ENSG00000000003  1902.6976  2266.9754   578.9869   554.5478   538.7689
## ENSG00000000005    11.0000     4.0000     8.0000    12.0000   117.0000
## ENSG00000000419  1521.8528  3423.1362  1390.3848  1377.5183  2213.9150
## ENSG00000000457  1020.4122  1147.7186   568.0432   459.7878   538.8831
## ENSG00000000460   611.4256   731.3597   355.8647   243.3910   360.3783
## ENSG00000000938   412.0000   890.0000  5539.0000  6648.0000  6147.0000
##                 SRR5088907 SRR5088908 SRR5088909 SRR5088910 SRR5088911
## ENSG00000000003  1429.9228  1125.8278  5927.2194   452.8942   3135.619
## ENSG00000000005     6.0000    17.0000     0.0000     0.0000      2.000
## ENSG00000000419  2277.1880  2124.2340  1776.4331  1367.2127   1670.511
## ENSG00000000457   764.3815   720.4699   918.0422   773.0971   1974.626
## ENSG00000000460   737.4045   642.1742  1320.0837   242.6844   2198.253
## ENSG00000000938  1322.0000  1597.0000   743.0000  5387.0000    138.000
##                 SRR5088912 SRR5088913 SRR5088914 SRR5088915 SRR5088916
## ENSG00000000003  1165.7217  2273.3066  1903.7784  3313.8153  1987.0716
## ENSG00000000005     5.0000   132.0000     8.0000     0.0000    35.0000
## ENSG00000000419  1758.3253  1741.7684  1679.5253  2000.8187  2285.5451
## ENSG00000000457   836.2455   905.8524   900.0133   917.5498   998.4041
## ENSG00000000460   636.3842   219.3707   693.5904   778.4120   693.2936
## ENSG00000000938   819.0000   311.0000   368.0000   469.0000   975.0000
##                 SRR5088917 SRR5088918 SRR5088919 SRR5088920 SRR5088921
## ENSG00000000003  1958.6521  2816.8639  2825.0211   388.6590   820.3042
## ENSG00000000005     9.0000    18.0000     0.0000     5.0000    10.0000
## ENSG00000000419  1869.2832  2442.3880  3731.7272  1763.7044  1862.8944
## ENSG00000000457  1176.0046  1283.6872   574.4145  1214.9047  1164.8337
## ENSG00000000460   681.3052   679.8709   517.7314   622.6102   548.7739
## ENSG00000000938   619.0000   506.0000    67.0000    91.0000    55.0000
##                 SRR5088922 SRR5088923 SRR5088924 SRR5088925 SRR5088926
## ENSG00000000003  2232.9844  1757.3521   5112.921  1396.4045   2290.002
## ENSG00000000005    14.0000    53.0000   7064.000   130.0000     18.000
## ENSG00000000419  2172.8916  3236.8590   2611.180  1147.5355   1901.407
## ENSG00000000457   830.8325   770.4955   1333.935   449.7312   1691.924
## ENSG00000000460   823.1905   791.6204   1470.649   189.0400   1543.972
## ENSG00000000938   166.0000   756.0000    520.000   430.0000    374.000
##                 SRR5088927 SRR5088928 SRR5088929 SRR5088930
## ENSG00000000003   1976.776   452.9074   599.5753  1529.0688
## ENSG00000000005     18.000     8.0000     9.0000     0.0000
## ENSG00000000419   1825.588  1152.3027   978.7727  1482.7561
## ENSG00000000457   1423.880   584.0388   562.9419   617.0739
## ENSG00000000460   1130.280   263.8671   257.5823   648.2131
## ENSG00000000938    442.000  5434.0000  3327.0000  1344.0000

Next, we create a metadata table to know the treatment for each SRA sample. Since it wasn’t available on SRA Run Selector web page, we needed to manually create it.

## Metadata importation to R
meta = read.csv("./metadata.csv", header = FALSE)
samples = column_to_rownames(meta, var = "V1")

Now, we create a DESeq2 dataset-like object, called DESeqDataSet, using the imported counts, sample names and metadata table.

dds = DESeqDataSetFromTximport(txi.kallisto.tsv,
                               colData = samples,
                               design = ~V2)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
## using counts and average transcript lengths from tximport

3.6.2 Exploratory Analysis

Before doing some exploration analysis, we pre-filter low counts.

## Pre-filtering

keep = rowSums(counts(dds)) >= 10
dds = dds[keep, ]

## Factor level
dds$V2 = factor(dds$V2, levels = c("Pre", "On"))

To identify if there’s a batch effect, we graph a PCA; normalizing the data with the variance stabilizing transformation.

vst = vst(dds, blind = FALSE)
## using 'avgTxLength' from assays(dds), correcting for library size

Once we hve the normalized data, we graph a PCA.

pca = prcomp(t(assay(vst)), rank. = 3)

components = pca[["x"]]
components = data.frame(components)

var = summary(pca)[["importance"]]['Proportion of Variance',]
var = 100*sum(var)

fig = plot_ly(components, x = ~PC1, y = ~PC2, z = ~PC3, color = dds$V2)
fig = fig %>% layout(title = "PCA")
fig
## No trace type specified:
##   Based on info supplied, a 'scatter3d' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter3d
## No scatter3d mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

As we can observe, there is not a visible cluster suggesting a batch effect. In fact, by some reason, almos all the samples cluster into a big cloud, regardless to their treatment.

3.6.3 Differential Expression Analysis

Once we know there are not batch effects in our data, we perform the Differential
Expression Analysis.

dds = DESeq(dds)
## estimating size factors
## using 'avgTxLength' from assays(dds), correcting for library size
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 8421 genes
## -- DESeq argument 'minReplicatesForReplace' = 7 
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing

To consider significant genes, we will use a significance level of \(\alpha=0.01\), and a log fold change expression greater than 1 or lower than -1.

res = results(dds, alpha = 0.01)

## Significative genes
resSig = subset(res, padj < 0.01)
dim(resSig)
## [1] 80  6

We have 80 significant genes. We can visualize them in a volcano plot.

res.df = as.data.frame(res)

## Add genes symbols
res.df$symbols = mapIds(org.Hs.eg.db, keys = rownames(res.df),
                        keytype = "ENSEMBL", column = "SYMBOL")


### Up and down genes
res.df$diffexpressed = "NO"
res.df$diffexpressed[res.df$log2FoldChange > 1 & res.df$padj < 0.01] <- "UP"
res.df$diffexpressed[res.df$log2FoldChange < -1 & res.df$padj < 0.01] <- "DOWN"



ggplot(data = res.df, aes(x=log2FoldChange, y=-log10(padj), col=diffexpressed, label=symbols)) + 
  geom_point() 
## Warning: Removed 19353 rows containing missing values (`geom_point()`).

3.7 Gene Ontology Enrichment

To make the GO terms enrichment, we use the org.Hs.eg.db database, containing the annotated genes from Homo sapiens. The ontology was made using the Biologicla Process (BP).

genes = rownames(resSig[resSig$log2FoldChange >= 1, ])

GO = enrichGO(gene = genes,
              OrgDb = org.Hs.eg.db,
              keyType = "ENSEMBL", 
              ont = "BP")

GO.df = as.data.frame(GO)
head(GO.df)
##                    ID                  Description GeneRatio   BgRatio
## GO:0006953 GO:0006953         acute-phase response      7/60  58/20772
## GO:0042178 GO:0042178 xenobiotic catabolic process      5/60  30/20772
## GO:0006721 GO:0006721  terpenoid metabolic process      7/60 119/20772
## GO:0008202 GO:0008202    steroid metabolic process     10/60 363/20772
## GO:0002526 GO:0002526  acute inflammatory response      7/60 131/20772
## GO:0006805 GO:0006805 xenobiotic metabolic process      7/60 131/20772
##                  pvalue     p.adjust       qvalue
## GO:0006953 3.132619e-10 4.388799e-07 3.271114e-07
## GO:0042178 2.286453e-08 1.601660e-05 1.193769e-05
## GO:0006721 5.091813e-08 2.305916e-05 1.718674e-05
## GO:0008202 8.142519e-08 2.305916e-05 1.718674e-05
## GO:0002526 9.875443e-08 2.305916e-05 1.718674e-05
## GO:0006805 9.875443e-08 2.305916e-05 1.718674e-05
##                                                                                                                                                                     geneID
## GO:0006953                                                 ENSG00000055955/ENSG00000132693/ENSG00000197249/ENSG00000228278/ENSG00000229314/ENSG00000241635/ENSG00000257017
## GO:0042178                                                                                 ENSG00000140505/ENSG00000160868/ENSG00000165841/ENSG00000241635/ENSG00000255974
## GO:0006721                                                 ENSG00000025423/ENSG00000140505/ENSG00000160868/ENSG00000165841/ENSG00000241635/ENSG00000244122/ENSG00000248144
## GO:0008202 ENSG00000025423/ENSG00000073734/ENSG00000140505/ENSG00000160868/ENSG00000165841/ENSG00000175336/ENSG00000180432/ENSG00000241635/ENSG00000244122/ENSG00000255974
## GO:0002526                                                 ENSG00000055955/ENSG00000132693/ENSG00000197249/ENSG00000228278/ENSG00000229314/ENSG00000241635/ENSG00000257017
## GO:0006805                                                 ENSG00000073734/ENSG00000140505/ENSG00000160868/ENSG00000165841/ENSG00000241635/ENSG00000244122/ENSG00000255974
##            Count
## GO:0006953     7
## GO:0042178     5
## GO:0006721     7
## GO:0008202    10
## GO:0002526     7
## GO:0006805     7

4 Results

5 Conclusions

6 References

7 Session Information

sessionInfo()
## R version 4.3.0 (2023-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Pop!_OS 22.04 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/Mexico_City
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] grid      stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] enrichplot_1.20.0           org.Hs.eg.db_3.17.0        
##  [3] clusterProfiler_4.8.1       genefilter_1.82.1          
##  [5] ComplexHeatmap_2.16.0       gprofiler2_0.2.1           
##  [7] biomaRt_2.56.0              GenomicFeatures_1.52.0     
##  [9] AnnotationDbi_1.62.1        rhdf5_2.44.0               
## [11] ggrepel_0.9.3               DESeq2_1.40.1              
## [13] SummarizedExperiment_1.30.1 Biobase_2.60.0             
## [15] MatrixGenerics_1.12.0       matrixStats_0.63.0         
## [17] GenomicRanges_1.52.0        GenomeInfoDb_1.36.0        
## [19] IRanges_2.34.0              S4Vectors_0.38.1           
## [21] BiocGenerics_0.46.0         lubridate_1.9.2            
## [23] forcats_1.0.0               stringr_1.5.0              
## [25] dplyr_1.1.0                 purrr_1.0.1                
## [27] readr_2.1.4                 tidyr_1.3.0                
## [29] tibble_3.1.8                tidyverse_2.0.0            
## [31] tximport_1.28.0             plotly_4.10.1              
## [33] ggplot2_3.4.1              
## 
## loaded via a namespace (and not attached):
##   [1] splines_4.3.0            BiocIO_1.10.0            bitops_1.0-7            
##   [4] ggplotify_0.1.0          filelock_1.0.2           polyclip_1.10-4         
##   [7] XML_3.99-0.14            lifecycle_1.0.3          doParallel_1.0.17       
##  [10] lattice_0.21-8           MASS_7.3-59              crosstalk_1.2.0         
##  [13] magrittr_2.0.3           sass_0.4.5               rmarkdown_2.20          
##  [16] jquerylib_0.1.4          yaml_2.3.7               cowplot_1.1.1           
##  [19] DBI_1.1.3                RColorBrewer_1.1-3       zlibbioc_1.46.0         
##  [22] ggraph_2.1.0             RCurl_1.98-1.12          yulab.utils_0.0.6       
##  [25] tweenr_2.0.2             rappdirs_0.3.3           circlize_0.4.15         
##  [28] GenomeInfoDbData_1.2.10  tidytree_0.4.2           annotate_1.78.0         
##  [31] codetools_0.2-19         DelayedArray_0.26.2      DOSE_3.26.1             
##  [34] xml2_1.3.3               ggforce_0.4.1            tidyselect_1.2.0        
##  [37] shape_1.4.6              aplot_0.1.10             farver_2.1.1            
##  [40] viridis_0.6.3            BiocFileCache_2.8.0      GenomicAlignments_1.36.0
##  [43] jsonlite_1.8.4           GetoptLong_1.0.5         ellipsis_0.3.2          
##  [46] tidygraph_1.2.3          survival_3.5-5           iterators_1.0.14        
##  [49] foreach_1.5.2            tools_4.3.0              progress_1.2.2          
##  [52] treeio_1.24.0            Rcpp_1.0.10              glue_1.6.2              
##  [55] gridExtra_2.3            xfun_0.39                qvalue_2.32.0           
##  [58] withr_2.5.0              fastmap_1.1.0            rhdf5filters_1.12.1     
##  [61] fansi_1.0.4              digest_0.6.31            timechange_0.2.0        
##  [64] R6_2.5.1                 gridGraphics_0.5-1       colorspace_2.1-0        
##  [67] GO.db_3.17.0             RSQLite_2.3.1            utf8_1.2.3              
##  [70] generics_0.1.3           data.table_1.14.8        rtracklayer_1.60.0      
##  [73] prettyunits_1.1.1        graphlayouts_1.0.0       httr_1.4.4              
##  [76] htmlwidgets_1.6.2        S4Arrays_1.0.4           scatterpie_0.1.9        
##  [79] pkgconfig_2.0.3          gtable_0.3.1             blob_1.2.3              
##  [82] XVector_0.40.0           shadowtext_0.1.2         htmltools_0.5.4         
##  [85] fgsea_1.26.0             clue_0.3-64              scales_1.2.1            
##  [88] png_0.1-8                ggfun_0.0.9              knitr_1.42              
##  [91] rstudioapi_0.14          tzdb_0.3.0               reshape2_1.4.4          
##  [94] rjson_0.2.21             nlme_3.1-162             curl_5.0.0              
##  [97] cachem_1.0.6             GlobalOptions_0.1.2      parallel_4.3.0          
## [100] HDO.db_0.99.1            restfulr_0.0.15          pillar_1.8.1            
## [103] vctrs_0.5.2              dbplyr_2.3.0             xtable_1.8-4            
## [106] cluster_2.1.4            evaluate_0.20            cli_3.6.0               
## [109] locfit_1.5-9.7           compiler_4.3.0           Rsamtools_2.16.0        
## [112] rlang_1.0.6              crayon_1.5.2             labeling_0.4.2          
## [115] plyr_1.8.8               stringi_1.7.12           viridisLite_0.4.1       
## [118] BiocParallel_1.34.1      assertthat_0.2.1         munsell_0.5.0           
## [121] Biostrings_2.68.1        lazyeval_0.2.2           GOSemSim_2.26.0         
## [124] Matrix_1.5-1             patchwork_1.1.2          hms_1.1.2               
## [127] bit64_4.0.5              Rhdf5lib_1.22.0          KEGGREST_1.40.0         
## [130] highr_0.10               igraph_1.4.3             memoise_2.0.1           
## [133] bslib_0.4.2              ggtree_3.8.0             fastmatch_1.1-3         
## [136] bit_4.0.5                downloader_0.4           gson_0.1.0              
## [139] ape_5.7-1